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Abstract 


We propose combinatorial cascading bandits, a class of partial monitoring prob¬ 
lems where at each step a learning agent chooses a tuple of ground items subject 
to constraints and receives a reward if and only if the weights of all chosen items 
are one. The weights of the items are binary, stochastic, and drawn independently 
of each other. The agent observes the index of the first chosen item whose weight 
is zero. This observation model arises in network routing, for instance, where the 
learning agent may only observe the first link in the routing path which is down, 
and blocks the path. We propose a UCB-like algorithm for solving our problems, 
CombCascade; and prove gap-dependent and gap-free upper bounds on its n-step 
regret. Our proofs build on recent work in stochastic combinatorial semi-bandits 
but also address two novel challenges of our setting, a non-linear reward function 
and partial observability. We evaluate CombCascade on two real-world problems 
and show that it performs well even when our modeling assumptions are violated. 
We also demonstrate that our setting requires a new learning algorithm. 


1 Introduction 

Combinatorial optimization [16] has many real-world applications. In this work, we study a class of 
combinatorial optimization problems with a binary objective function that returns one if and only if 
the weights of all chosen items are one. The weights of the items are binary, stochastic, and drawn 
independently of each other. Many popular optimization problems can be formulated in our setting. 
Network routing is a problem of choosing a routing path in a computer network that maximizes the 
probability that all links in the chosen path are up. Recommendation is a problem of choosing a list 
of items that minimizes the probability that none of the recommended items are attractive. Both of 
these problems are closely related and can be solved using similar techniques (Section 2.3). 

Combinatorial cascading bandits are a novel framework for online learning of the aforementioned 
problems where the distribution over the weights of items is unknown. Our goal is to maximize the 
expected cumulative reward of a learning agent in n steps. Our learning problem is challenging for 
two main reasons. First, the reward function is non-linear in the weights of chosen items. Second, 
we only observe the index of the first chosen item with a zero weight. This kind of feedback arises 
frequently in network routing, for instance, where the learning agent may only observe the first link 
in the routing path which is down, and blocks the path. This feedback model was recently proposed 
in the so-called cascading bandits [10]. The main difference in our work is that the feasible set can 
be arbitrary. The feasible set in cascading bandits is a uniform matroid. 


1 




Stochastic online learning with combinatorial actions has been previously studied with semi-bandit 
feedback and a linear reward function [8, 11, 12], and its monotone transformation [5]. Established 
algorithms for multi-armed bandits, such as UCBl [3], KL-UCB [9], and Thompson sampling [18, 2]; 
can be usually easily adapted to stochastic combinatorial semi-bandits. However, it is non-trivial to 
show that the algorithms are statistically efficient, in the sense that their regret matches some lower 
bound. Kveton et al. [12] recently showed this for CombUCBl, a form of UCBl. Our analysis builds 
on this recent advance but also addresses two novel challenges of our problem, a non-linear reward 
function and partial observability. These challenges cannot be addressed straightforwardly based on 
Kveton effll. [12, 10]. 

We make multiple contributions. In Section 2, we define the online learning problem of combinato¬ 
rial cascading bandits and propose CombCascade, a variant of UCBl, for solving it. CombCascade 
is computationally efficient on any feasible set where a linear function can be optimized efficiently. 
A minor-looking improvement to the UCBl upper confidence bound, which exploits the fact that the 
expected weights of items are bounded by one, is necessary in our analysis. In Section 3, we derive 
gap-dependent and gap-free upper bounds on the regret of CombCascade, and discuss the tightness 
of these bounds. In Section 4, we evaluate CombCascade on two practical problems and show that 
the algorithm performs well even when our modeling assumptions are violated. We also show that 
CombUCBl [8, 12] cannot solve some instances of our problem, which highlights the need for a new 
learning algorithm. 

2 Combinatorial Cascading Bandits 

This section introduces our learning problem, its applications, and also our proposed algorithm. We 
discuss the computational complexity of the algorithm and then introduce the co-called disjunctive 
variant of our problem. We denote random variables by boldface letters. The cardinality of set A is 
IA| and we assume that min 0 = -foo. The binary and operation is denoted by A, and the binary or 
is V. 

2.1 Setting 

We model our online learning problem as a combinatorial cascading bandit. A combinatorial cas¬ 
cading bandit is a tuple B = {E, P, 0), where E = {1,..., L} is a finite set of L ground items, P 
is a probability distribution over a binary hypercube {0,1} , 0 C H* (E), and; 

n*(E;) = {(ai,..., Ofe) : k>l, ai,...,ak&E, a,^aj for any i ^ }} 

is the set of all tuples of distinct items from E. We refer to 0 as the feasible set and to A € 0 as a 
feasible solution. We abuse our notation and also treat A as the set of items in solution A. Without 
loss of generality, we assume that the feasible set 0 covers the ground set, E = U0. 

Let be an i.i.d. sequence of n weights drawn from distribution P, where Wj G {0,1}^. At 

time t, the learning agent chooses solution At = (a{,... , a[Aj|) S 0 based on its past observations 
and then receives a binary reward: 



as a response to this choice. The reward is one if and only if the weights of all items in At are one. 
The key step in our solution and its analysis is that the reward can be expressed as = /(A^, Wj), 
where / : 0 x [0,1]'® —>■ [0,1] is a reward function, which is defined as; 


f{A,w) = Y[w{e) , Ag0, u>G [0,1]®. 


At the end of time t, the agent observes the index of the first item in At whose weight is zero, and 
-|-(X) if such an item does not exist. We denote this feedback by Ot and define it as; 


Ot = min {1 <k< |At| ; Wt(a{,) = 0} . 


Note that 0( fully determines the weights of the first min {Ot, |At|} items in At. In particular; 


wt(a{,) = l{k < Ot} fc = 1,... ,min{Ot, |At|} . 


( 1 ) 
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Accordingly, we say that item e is observed at time t if e = for some 1 < fc < minjOt, |At|}. 
Note that the order of items in Aj affects the feedback Oj but not the reward r^. This differentiates 
our problem from combinatorial semi-bandits. 

The goal of our learning agent is to maximize its expected cumulative reward. This is equivalent to 
minimizing the expected cumulative regret in n steps: 

R(n) = E EILi i?(At, wt)] , 

where R{At,Wt) = f{A*,'Wt) — /(A(, wt) is the instantaneous stochastic regret of the agent at 
time t and A* = argmax^g0E [/(A, w)] is the optimal solution in hindsight of knowing P. For 
simplicity of exposition, we assume that A*, as a set, is unique. 

A major simplifying assumption, which simplifies our optimization problem and its learning, is that 
the distribution P is factored: 


P{w) =Y\eeEPe{w{e)), (2) 

where P^ is a Bernoulli distribution with mean w{e). We borrow this assumption from the work of 
Kveton et al. [10] and it is critical to our results. We would face computational difficulties without 
it. Under this assumption, the expected reward of solution A G 0, the probability that the weight of 
each item in A is one, can be written as E [/(A, w)] = /(A, w), and depends only on the expected 
weights of individual items in A. It follows that: 

A* = arg max /(A, w). 

In Section 4, we experiment with two problems that violate our independence assumption. We also 
discuss implications of this violation. 

Several interesting online learning problems can be formulated as combinatorial cascading bandits. 
Consider the problem of learning routing paths in Simple Mail Transfer Protocol (SMTP) that max¬ 
imize the probability of e-mail delivery. The ground set in this problem are all links in the network 
and the feasible set are all routing paths. At time t, the learning agent chooses routing path At and 
observes if the e-mail is delivered. If the e-mail is not delivered, the agent observes the first link in 
the routing path which is down. This kind of information is available in SMTP. The weight of item 
e at time t is an indicator of link e being up at time t. The independence assumption in (2) requires 
that all links fail independently. This assumption is common in the existing network routing models 
[6]. We return to the problem of network routing in Section 4.2. 

2.2 CombCascade Algorithm 

Our proposed algorithm, CombCascade, is described in Algorithm 1. This algorithm belongs to the 
family of UCB algorithms. At time t, CombCascade operates in three stages. First, it computes the 
upper confidence bounds (UCBs) Uj G [0,1]^ on the expected weights of all items in E. The UCB 
of item e at time t is defined as: 

Ut(e) = min {wTj_i(e)(e) -f Q_i_Tt_i(e), l} , (3) 

where Ws(e) is the average of s observed weights of item e, Tt(e) is the number of times that item e 
is observed in t steps, and ct^s = \J (1-5 logf)/s is the radius of a confidence interval around Ws(e) 
after t steps such that w{e) G [ws(e) — Ct^s, Ws(e) -f Ct^s] holds with a high probability. After the 
UCBs are computed, CombCascade chooses the optimal solution with respect to these UCBs: 

At = arg max f{A,lJt) ■ 

Finally, CombCascade observes Ot and updates its estimates of the expected weights based on the 
weights of the observed items in (1), for all items such that fc < 0(. 

For simplicity of exposition, we assume that CombCascade is initialized by one sample wq ~ P. If 
Wo is unavailable, we can formulate the problem of obtaining wq as an optimization problem on 0 
with a linear objective [12]. The initialization procedure of Kveton et al. [12] tracks observed items 
and adaptively chooses solutions with the maximum number of unobserved items. This approach is 
computationally efficient on any feasible set 0 where a linear function can be optimized efficiently. 

CombCascade has two attractive properties. First, the algorithm is computationally efficient, in the 
sense that At = arg max ^^0 log(Ut(e)) is the problem of maximizing a linear function on 


3 




Algorithm 1 CombCascade for combinatorial cascading bandits. 
// Initialization 
Observe Wq ~ P 
WeeE: To(e) 4 - 1 
We G E : Wi(e) 4— Wo(e) 

for all f = 1,..., n do 

// Compute UCBs 

Me&E ■. Ut(e) = min {wTt_i(e)(e) + Ct_i,Tt_i(e), l} 

// Solve the optimization problem and get feedback 

At ^ argmax^ge /(^,Ut) 

Observe Ot S {1,..., |At|, +oo} 

// Update statistics 

Veei;:Tt(e)^Tt_i(e) 
for all fc = 1, ..., minjOt, |At|} do 
e 4- 

Tt(e) 4— Tt(e) + 1 

4 4 Tt_i(e)wTt_i(e)(e) + l{fc < Ot} 


0. This problem can be solved efficiently for various feasible sets 0, such as matroids, matchings, 
and paths. Second, CombCascade is sample efficient because the UCB of solution A, f(A, Ut), is a 
product of the UCBs of all items in A, which are estimated separately. The regret of CombCascade 
does not depend on |0| and is polynomial in all other quantities of interest. 

2.3 Disjunctive Objective 

Our reward model is conjuctive, the reward is one if and only if the weights of all chosen items are 
one. A natural alternative is a disjunctive model r^ = maXggAt Wt(e) = VeGAt the reward 

is one if the weight of any item in At is one. This model arises in recommender systems, where the 
recommender is rewarded when the user is satished with any recommended item. The feedback O* 
is the index of the hrst item in At whose weight is one, as in cascading bandits [10]. 

Let /v : 0 X [0,1]-®—[0,1] be a reward function, which is dehned as f\/{A, w) = 1 — neGA(^ ~ 
u>(e)). Then under the independence assumption in (2), E [/v(^, vv)] = fy{A, w) and: 

A* = argmax fy{A,w) = argmin 1 [ (1 — u){e)) = argmin f{A, 1 — u;). 
agg Ago agg 

Therefore, A* can be learned by a variant of CombCascade where the observations are 1 — Wj and 
each UCB U 4 (e) is substituted with a lower confidence bound (LCB) on 1 — w{e): 

Lt(e) = max{l - WTt_i(e)(e) - Ct_i,Tt_i(e), 0} . 

Let R{At,'Wt) = /(At, 1 — Wt) — f{A*, 1 — wt) be the instantaneous stochastic regret at time t. 
Then we can bound the regret of CombCascade as in Theorems 1 and 2. The only difference is that 
Ae^min and /* are redehned as: 

Ae,min = minAGG:eGA,AA>0 f {A 1 - w) - f {A*, 1 - w) , f* = f {A*, 1 - w) . 

3 Analysis 

We prove gap-dependent and gap-free upper bounds on the regret of CombCascade in Section 3.1. 
We discuss these bounds in Section 3.2. 

3.1 Upper Bounds 

We dehne the suboptimality gap of solution A = (ui, ..., a|A|) as Aa = fiA*,w) - f{A,w) and 
the probability that all items in A are observed as pA = Ili^i ^ w{ak)- For convenience, we dehne 
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shorthands f* = f{A*,w) and p* = pA* ■ Let E = E \ A* he. the set of suboptimal items, the items 
that are not in A* . Then the minimum gap associated with suboptimal item e G E is: 


Ae,min = f {A*, w) - max^ge:eGA.AA>0 f{A, w) . 


Let K = max{|Al| : A G 0} be the maximum number of items in any solution and /* > 0. Then 
the regret of CombCascade is bounded as follows. 


Theorem 1. The regret o/CombCascade is bounded as R{n) < 


K 

T* 


E 


4272 

Ae.min 


logn + 



Proof. The proof is in Appendix A. The main idea is to reduce our analysis to that of CombUCBl in 
stochastic combinatorial semi-bandits [12]. This reduction is challenging for two reasons. First, our 
reward function is non-linear in the weights of chosen items. Second, we only observe some of the 
chosen items. 


Our analysis can be trivially reduced to semi-bandits by conditioning on the event of observing all 
items. In particular, let Ht = (Ai, Oi,..., At_i, 0(_i, At) be the history of CombCascade up to 
choosing solution At, the first t — 1 observations and t actions. Then we can express the expected 
regret at time t conditioned on Rt as: 

E[i?(At,wt)|Ht] =E[Aa,(1/paJ1{Aa. > 0 , O* > \At\}\nt] 

and analyze our problem under the assumption that all items in At are observed. This reduction is 
problematic because the probability p_Xt can be low, and as a result we get a loose regret bound. 

We address this issue by formalizing the following insight into our problem. When f{A, w) <C /*, 
CombCascade can distinguish A from A* without learning the expected weights of all items in A. 
In particular, CombCascade acts implicitly on the prefixes of suboptimal solutions, and we choose 
them in our analysis such that the probability of observing all items in the prefixes is “close” to /*, 
and the gaps are “close” to those of the original solutions. 

Lemma 1. Let A = (oi,..., a|A|) G Q be a feasible solution and Bk = (ai,..., a^) be a prefix of 
k < |A| items of A. Then k can be set such that Ab^ > ^Aa andpE^. > 5 /*- 

Then we count the number of times that the prefixes can be chosen instead of A* when all items in 
the prefixes are observed. The last remaining issue is that f{A, Uj) is non-linear in the confidence 
radii of the items in A. Therefore, we bound it from above based on the following lemma. 

Lemma 2. Let 0 < pi,... ,pk < 1 and ui,, uk > 0. Then: 

nf=i min {Pk + Mfc, 1} < nf=i Pk + Ef=i Uk ■ 

This bound is tight when pi,... ,Pk = 1 and Ui,..., Uk = 0. 


The rest of our analysis is along the lines of Theorem 5 in Kveton et al. [12]. We can achieve linear 
dependency on K, in exchange for a multiplicative factor of 534 in our upper bound. ■ 


We also prove the following gap-free bound. 


Theorem 2. The regret o/CombCascade is bounded as R(n) < 131 
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L. 


Proof. The proof is in Appendix B. The key idea is to decompose the regret of CombCascade into 
two parts, where the gaps Aa^ are at most e and larger than e. We analyze each part separately and 
then set e to get the desired result. ■ 


3.2 Discussion 

In Section 3.1, we prove two upper bounds on the n-step regret of CombCascade: 

Theorem 1: 0(AL(1//*)(1/A) logn), Theorem 2: 0{\/KL{1/f*)n\ogn ), 

where A = min^g^ Ae^min- These bounds do not depend on the total number of feasible solutions 
|0| and are polynomial in any other quantity of interest. The bounds match, up to 0(1//*) factors. 
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w = (0.4,0.4,0.2, 0.2) 



(0.4,0.4,0.9, 0.1) 



w = (0.4,0.4,0.3,0.3) 


4k 6k 
Step n 



Figure 1: The regret of CombCascade and CombUCBl in the synthetic experiment (Section 4.1). The 
results are averaged over 100 runs. 


the upper bounds of CombUCBl in stochastic combinatorial semi-bandits [12]. Since CombCascade 
receives less feedback than CombUCBl, this is rather surprising and unexpected. The upper bounds 
of Kveton et al. [12] are known to be tight up to poly logarithmic factors. We believe that our upper 
bounds are also tight in the setting similar to Kveton et al. [12], where the expected weight of each 
item is close to 1 and likely to be observed. 

The assumption that /* is large is often reasonable. In network routing, the optimal routing path is 
likely to be reliable. In recommender systems, the optimal recommended list often does not satisfy 
a reasonably large fraction of users. 

4 Experiments 

We evaluate CombCascade in three experiments. In Section 4.1, we compare it to CombUCBl [12], 
a state-of-the-art algorithm for stochastic combinatorial semi-bandits with a linear reward function. 
This experiment shows that CombUCBl cannot solve all instances of our problem, which highlights 
the need for a new learning algorithm. It also shows the limitations of CombCascade. We evaluate 
CombCascade on two real-world problems in Sections 4.2 and 4.3. 

4.1 Synthetic 

In the first experiment, we compare CombCascade to CombUCBl [12] on a synthetic problem. This 
problem is a combinatorial cascading bandit with L = 4 items and 0 = {(1, 2), (3,4)}. CombUCBl 
is a popular algorithm for stochastic combinatorial semi-bandits with a linear reward function. We 
approximate max^ge f{A, w) by min^ge ~ ^(c))- This approximation is motivated by 

the fact that f{A, w) = IleGA W{e) « I-EeGA (1 — w{e)) as miuegB w{e) —?► 1. We update the 
estimates of w in CombUCBl as in CombCascade, based on the weights of the observed items in (1). 

We experiment with three different settings of w and report our results in Figure 1 . The settings of 
w are reported in our plots. We assume that Wi(e) are distributed independently, except for the last 
plot where wt(3) = Wt(4). Our plots represent three common scenarios that we encountered in our 
experiments. In the first plot, arg max^^ 0 /( 2 I, u;) = argmin^g 0 — f(;(e)). In this case, 

both CombCascade and CombUCBl can learn A*. The regret of CombCascade is slightly lower than 
that of CombUCBl. In the second plot, argmax^g 0 f{A, w) ^ argmin^g 0 ~ 

this case, CombUCBl cannot learn A* and therefore suffers linear regret. In the third plot, we violate 
our modeling assumptions. Perhaps surprisingly, CombCascade can still learn the optimal solution 
A*, although it suffers higher regret than CombUCBl. 

4.2 Network Routing 

In the second experiment, we evaluate CombCascade on a problem of network routing. We experi¬ 
ment with six networks from the RocketFuel dataset [17], which are described in Figure 2a. 

Our learning problem is formulated as follows. The ground set E are the links in the network. The 
feasible set 0 are all paths in the network. At time t, we generate a random pair of starting and end 
nodes, and the learning agent chooses a routing path between these nodes. The goal of the agent is 
to maximizes the probability that all links in the path are up. The feedback is the index of the first 
link in the path which is down. The weight of link e at time t, Wt(e), is an indicator of link e being 


6 







Network Nodes 

Links 

1221 

108 

153 

1239 

315 

972 

1755 

87 

161 

3257 

161 

328 

3967 

79 

147 

6461 

141 

374 


(a) 



Step n 


(b) 


step n 


Figure 2: a. The description of six networks from our network routing experiment (Section 4.2). b. 
The n-step regret of CombCascade in these networks. The results are averaged over 50 runs. 


up at time t. We model 'Wt{e) as an independent Bernoulli random variable Wt(e) ^ B(w(e)) with 
mean w(e) = 0.7 + 0.2 local(e), where local(e) is an indicator of link e being local. We say that 
the link is local when its expected latency is at most 1 millisecond. About a half of the links in our 
networks are local. To summarize, the local links are up with probability 0.9; and are more reliable 
than the global links, which are up only with probability 0.7. 

Our results are reported in Figure 2b. We observe that the n-step regret of CombCascade flattens as 
time n increases. This means that CombCascade learns near-optimal policies in all networks. 

4.3 Diverse Recommendations 

In our last experiment, we evaluate CombCascade on a problem of diverse recommendations. This 
problem is motivated by on-demand media streaming services like Netflix, which often recommend 
groups of movies, such as “Popular on Netflix” and “Dramas”. We experiment with the MovieLens 
dataset [13] from March 2015. The dataset contains 138k people who assigned 20M ratings to 27k 
movies between January 1995 and March 2015. 

Our learning problem is formulated as follows. The ground set E are 200 movies from our dataset; 
25 most rated animated movies, 75 random animated movies, 25 most rated non-animated movies, 
and 75 random non-animated movies. The feasible set 0 are all iT-permutations of E where Ar/2 
movies are animated. The weight of item e at time t, Wt{e), indicates that item e attracts the user at 
time t. We assume that wt(e) = 1 if and only if the user rated item e in our dataset. This indicates 
that the user watched movie e at some point in time, perhaps because the movie was attractive. The 
user at time t is drawn randomly from our pool of users. The goal of the learning agent is to learn a 
list of items A* = argmax^g 0 E w)] that maximizes the probability that at least one item 

is attractive. The feedback is the index of the first attractive item in the list (Section 2.3). We would 
like to point out that our modeling assumptions are violated in this experiment. In particular, w* (e) 
are correlated across items e because the users do not rate movies independently. The result is that 
A* ^ arg max ^^0 /v {A, w). It is NP-hard to compute A*. However, E [/v {A, w)] is submodular 
and monotone in A, and therefore a (1 — 1/e) approximation to A* can be computed greedily. We 
denote this approximation by A* and show it for AT = 8 in Figure 3a. 

Our results are reported in Figure 3b. Similarly to Figure 2b, the n-step regret of CombCascade is 
a concave function of time n for all studied K. This indicates that CombCascade solutions improve 
over time. We note that the regret does not flatten as in Figure 2b. The reason is that CombCascade 
does not learn A*. Nevertheless, it performs well and we expect comparably good performance in 
other domains where our modeling assumptions are not satisfied. Our current theory cannot explain 
this behavior and we leave it for future work. 

5 Related Work 

Our work generalizes cascading bandits of Kveton et al. [10] to arbitrary combinatorial constraints. 
The feasible set in cascading bandits is a uniform matroid, any list of K items out of L is feasible. 
Our generalization significantly expands the applicability of the original model and we demonstrate 
this on two novel real-world problems (Section 4). Our work also extends stochastic combinatorial 
semi-bandits with a linear reward function [ 8 , 11, 12] to the cascade model of feedback. A similar 
model to cascading bandits was recently studied by Combes et al. [7]. 
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Movie title 

Animation 

Pulp Fiction 

No 

Forrest Gump 

No 

Independence Day 

No 

Shawshank Redemption 

No 

Toy Story 

Yes 

Shrek 

Yes 

Who Framed Roger Rahbit? 

Yes 

Aladdin 

Yes 



(a) (b) 

Figure 3; a. The optimal list of 8 movies in the diverse recommendations experiment (Section 4.3). 
b. The n-step regret of CombCascade in this experiment. The results are averaged over 50 runs. 


Our generalization is significant for two reasons. First, CombCascade is a novel learning algorithm. 
CombUCBl [12] chooses solutions with the largest sum of the UCBs. CascadeUCBl [10] chooses K 
items out of L with the largest UCBs. CombCascade chooses solutions with the largest product of 
the UCBs. All three algorithms can find the optimal solution in cascading bandits. However, when 
the feasible set is not a matroid, it is critical to maximize the product of the UCBs. CombUCBl may 
learn a suboptimal solution in this setting and we illustrate this in Section 4.1 . 

Second, our analysis is novel. The proof of Theorem 1 is different from those of Theorems 2 and 3 
in Kveton et al. [10]. These proofs are based on counting the number of times that each suboptimal 
item is chosen instead of any optimal item. They can be only applied to special feasible sets, such a 
matroid, because they require that the items in the feasible solutions are exchangeable. We build on 
the recent work of Kveton et al. [12] to achieve linear dependency on K in Theorem 1. The rest of 
our analysis is novel. 

Our problem is a partial monitoring problem where some of the chosen items may be unobserved. 
Agrawal et al. [1] and Bartok et al. [4] studied partial monitoring problems and proposed learning 
algorithms for solving them. These algorithms are impractical in our setting. As an example, if we 
formulate our problem as in Bartok et al. [4], we get |0| actions and 2^ unobserved outcomes; and 
the learning algorithm reasons over |0| pairs of actions and requires 0(2^) space. Lin et al. [15] 
also studied combinatorial partial monitoring. Their feedback is a linear function of the weights of 
chosen items. Our feedback is a non-linear function of the weights. 

Our reward function is non-linear in unknown parameters. Chen et al. [5] studied stochastic combi¬ 
natorial semi-bandits with a non-linear reward function, which is a known monotone function of an 
unknown linear function. The feedback in Chen et al. [5] is semi-bandit, which is more informative 
than in our work. Le et al. [14] studied a network optimization problem where the reward function 
is a non-linear function of observations. 


6 Conclusions 

We propose combinatorial cascading bandits, a class of stochastic partial monitoring problems that 
can model many practical problems, such as learning of a routing path in an unreliable communica¬ 
tion network that maximizes the probability of packet delivery, and learning to recommend a list of 
attractive items. We propose a practical UCB-like algorithm for our problems, CombCascade, and 
prove upper bounds on its regret. We evaluate CombCascade on two real-world problems and show 
that it performs well even when our modeling assumptions are violated. 

Our results and analysis apply to any combinatorial action set, and therefore are quite general. The 
strongest assumption in our work is that the weights of items are distributed independently of each 
other. This assumption is critical and hard to eliminate (Section 2.1). Nevertheless, it can be easily 
relaxed to conditional independence given the features of items, along the lines of Wen et al. [19]. 
We leave this for future work. From the theoretical point of view, we want to derive a lower bound 
on the n-step regret in combinatorial cascading bandits, and show that the factor of f* in Theorems 
1 and 2 is intrinsic. 
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A Proof of Theorem 1 


Our proof has four main parts. In Appendix A. 1, we bound the regret associated with the event that 
our high-probability confidence intervals do not hold. In Appendix A.2, we change counted events, 
from partially-observed suboptimal solutions to their fully-observed prefixes. In Appendix A.3, we 
bound the number of times that any suboptimal prefix can be chosen instead of the optimal solution 
A*. In Appendix A.4, we apply the counting argument of Kveton et al. [12] and finish our proof. 

Let Rt = R{At, wt) be the stochastic regret of CombCascade at time t, where At and wj are the 
solution and the weights of the items at time t, respectively. Let; 

£t = {3eG E s.t. \w{e) - WTt_i(e)(e)| > Ct_i^Tt_i(e)} 

be the event that w{e) is outside of the high-probability confidence interval around wxt_i(e)(e) for 
at least one item e G E at time f; and let 8t be the complement of event Et, the event that w{e) is in 
the high-probability confidence interval around Wxj_^(e)(e) for all items e G E at time t. Then we 
can decompose the expected regret of CombCascade as: 


R{n) = E 




-E 




(4) 


A.l Confidence Intervals Fail 

The first term in (4) is easy to bound because Rf is bounded and our confidence intervals hold with 
high probability. In particular, Hoeffding’s inequality yields that for any e, s, and t: 

Pi\w{e) - Ws(e)| > ct,s) < 2exp[-31ogf], 


and therefore: 


E 




n t 

- E E E -P(l^(e) - ^ ct.s) 

e^E t—1 s—1 

71 t n 2 

< 2EEE®^p[“^^°S^] < 2EE^”^ - 

e^E t—1 s=l e^E t—1 


Since Rf < 1, E EJli Rt] < ^L. 


A.l From Partially-Observed Solutions to Fully-Observed Prefixes 


Let Rt — (Al, Oi,..., Af_i, Of_i, Af) be the history of CombCascade up to choosing solution 
Af, the first t—1 observations and t actions. Let E [• | "Hfj be the conditional expectation given this 
history. Then we can rewrite the expected regret at time t conditioned on Rt as: 


E[Rf |7ff] =E[Aa4{Aa, 


>0}|Hf]=E 


^]L{Aa, > 0, Of > |Af|} 

IP At 



and analyze our problem under the assumption that all items in Af are observed. This reduction is 
problematic because the probability pA* can be low, and as a result we get a loose regret bound. To 
address this problem, we introduce the notion of prefixes. 


Let A = (oi,..., a|A|)- Then B = (oi,..., a^) is a prefix of A for any k < |A|. In the rest of our 
analysis, we treat prefixes as feasible solutions to our original problem. Let Bf be a prefix of Af as 
defined in Lemma 1. Then Ab^ > ^Aaj and A and we can bound the expected regret at 
time t conditioned on Rt as; 


E [Rf \Rt]=E 


Ay 


PBt 


^1{Aa, >0, Of > |Bf|} 


Rt 


< —E [Ab4{Ab, > 0, Of > |Bf 1} I Rt] . 


(5) 
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Now we bound the second term in (4): 


E 




^E [!{£*} E[Rt I Ht]] 


t=i 

(b) 4 
< —E 

■ f* 


Ab. >0, Ot > |B*|} 


( 6 ) 


Equality (a) is due to the tower rule and that ^{£t} is only a function of "Ht- Inequality (b) follows 
from the upper bound in (5). 

A.3 Counting Suboptimal Prefixes 

Let: 


“ 1 ^ E '^".Tt-i(e) > Abj > 0, Ot > |Bt 


(7) 


eGBt 


be the event that suboptimal prefix Bt is “hard to distinguish” from A*, where Bj = Bj \ A* is the 
set of suboptimal items in Bt. The goal of this section is to bound (6) by a function of 

We bound ABjl{f t, Ab* >0, Ot > |Bt|} from above for any suboptimal prefix Bt. Our bound 
is proved based on several facts. First, Bt is a prefix of At, and hence /(Bt, Ut) > /(At,Ut)for 
any Ut. Second, when CombCascade chooses At, /(At, Ut) > f{A*, Ut). It follows that: 

J] Ut(e) = /(Bt, Ut) > /(At, Ut) > f{A*, Ut) = J] Ut(e). 


eGBt 


eeA* 


Now we divide both sides by JleGA'nBt Ut(e): 

n Ut(e) > n Ut(e) 

e^Bt 

and substitute the definitions of the UCBs from (3): 

min {wTt_i(e)(e) + Ct_i,Tt_i(e), 1} > n ™ {wTt-i(e)(e) + Ct_i,Tt_i(e), 1} ■ 

eGBt eGA’*\Bt 

Since £t happens, \w{e) — wx^_i(e)(e)| < Ct-i,Tt-iie) for all e € and therefore; 

n “iin{wTt_i(e)(e)+Ct_i,Tt_i(e),l} > ff w{e) 
eGA*\Bt eGA*\Bt 

ll min{wTt_t(e)(e)+Ct_i,Tt_i(e),l} < n min |r(;(e) + 2ct_i,Tt-i(e)) 1} • 

eGBt e^Bt 

By Lemma 2: 

P[ min{u;(e) + 2ct_i,Tt_i(e),l} < ff w{e) + 2 ^ Ct_nTt_i(e) • 

eGBt e^Bt eGBt 

Finally, we chain the last four inequalities and get: 

n 2 E o-i.Tt_i(e) > n w(e), 

eGBt eGBt eGA*\Bt 

which further implies that: 

2 E ct-i.Tt-i(e) > n “ n 

eGA*\Bt 


eGBt 


eGBt 


> w{e) 

eGA*nBt 


n “ n 

£G-A*\Bt eGBt 


<1 


= Ab^ . 
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Since c„ Xt_i(e) ^ Ct-i.Tt_i(e) for any time t < n, the event Tt in (7) happens. Therefore, we can 
bound the right-hand side in (6) as: 


E 


^ Abj > 0, Oj > |Bt|} 

.4=1 




VI 

-1 

R(n) 


where: 


R(n) =;^AB,l{.Ft} . (8) 

4=1 


A.4 CombUCBl Analysis of Kveton et al. [12] 


It remains to bound R(n) in (8). Note that the event Jx can happen only if the weights of all items 
in B( are observed. As a result, R(n) can be bounded as in stochastic combinatorial semi-bandits. 
The key idea of our proof is to introduce infinitely-many mutually-exclusive events and then bound 
the number of times that these events happen when a suboptimal prefix is chosen [12]. The event i 
at time t is: 


G 




{less than f3iK items in Bj were observed at most ai 



logn times, 


less than Pi-iK items in B* were observed at most ^ logn times, 

at least PiK items in B( were observed at most Ui ^ log n times, 

04> |Bt|}, 

where we assume that Ab^ > 0 ; and the constants (a^) and {pi) are defined as: 

l=Po>Pl>P2>...>Pk>--- 

ai > Q !2 > . . . > Qffe > ... , 

and satisfy limi_>.oo ai = limi_).oo Pi = 0. By Lemma 3 of Kveton et al. [12], Gi^t are exhaustive at 
any time t when {ap and {Pp satisfy: 




Pi-1 - Pi 


< 1 . 


Z =1 


In this case: 


R(n) = X]AB,l{J-4} = £^AB,l{G.,t, Ab, >0} . 


i=l 4=1 


Now we introduce item-specific variants of events Gi^t and associate the regret at time t with these 
events. In particular, let: 

{ ~ K'^ 

Ge,i,t = Gi^t n < e € Bt, Tt-i{e) < ai logn 

f Bj 

be the event that item e is not observed “sufficiently often” under event Gi t. Then it follows that: 
l{Gi,t, Abj > 0} < —— ^ l{Ge,i,t, Abj > 0} 


e^E 


because at least PiK items are not observed “sufficiently often” under event Gi_t. Therefore, we can 
bound R(n) as: 


R(n)< 5 ]£X^l{Ge,M, Ab, > 0 } 


At 


egB i=l 4=1 


P^K 
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Let each item e be in suboptimal prefixes and Ae,i > ... > ^e,N^ be the gaps of these prefixes, 
ordered from the largest gap to the smallest. Then R(n) can be further bounded as; 


N, 


R(n) < E E E E = Ae,4 


A, 


(a) 


eeE *=1 *=1 k=l 
oo n Ne 


,K 




<EEEEi^ e e Bt, Tt_i(e) < a^—^ logn, Ab^ = Ae,fc, O* > \Bt 

^e,k 


e^E 


2=1 i=l fc=l 


P^K 


rK\ 

< 1 : 1 : 


eGB 


i=l 


UiK log n 

Pi 


Nc 


Ae 1 


A? 


k = 2 


E 


^e,k ^e,k—l 


(c) CXiK \o^Tl 2 

- ^ Pi ^e,N^ 

eGB*=l 




A, 


eGB 


e,N, 


E 

.i=l 




logn, 


where inequality (a) follows from the definition of Ge,i,t and inequality (b) is from solving; 


n Ne 


max EE^i eG Bt, rE"’‘^"’"(e) < log ft, = N,k, Ot > \Bt 

l.n, l.n 


^e,k 


where Ai-n = (^ 1 , ■ ■ ■, A^) is a sequence of n solutions, Oi.„ = (Oi,..., 0„) is a sequence of n 
observations, (e) is the number of times that item e is observed in t steps under Ai-n and 

Oi-n, Bt is the prefix of At as defined in Lemma 1, and Bt = Bt\ A*. Inequality (c) is by Lemma 
3 of Kveton et al. [11]; 


tVe 


A, 


e,l 


^e,l 


E 


/c =2 



A? 


fc-1 


2 


For the same (at) and {Pt) as in Theorem 4 of Kveton et al. [12], 

^Bt > for any solution At and its prefix Bt, we have ^ ^A 

inequalities and get; 


R{n) < 


-E 


R(n) 


K ^ 4272 

e£E 


logn 


< 267. Moreover, since 
e,min- Now we chain all 



B Proof of Theorem 2 


The key idea is to decompose the regret of CombCascade into two parts, where the gaps Aa^ are at 
most e and larger than e. In particular, note that for any e > 0; 


R{n) = E 


E < e} Rt 

.4=1 


+ E 


Ei{Aa* >£}R4 


(9) 


The first term in (9) can be bounded trivially as; 


E 


E ll{AAt < £} R4 
.4=1 


= E 


EAa, 1 {Aa, <£, Aa, > 0 } 
.4=1 


< en 


because Aa^ < £. The second term in (9) can be bounded in the same way as Rin) in Theorem 1. 
The only difference is that Ag min A e for all e G E. Therefore; 


E 


E ll{AAt > £} R 4 

.4=1 


K 

< — 
- f* 


E 

eGB 


4272 

Ae^min 


logn+ —L < 


4272KL , TT^ 

— - -logn+ —L. 

f*e 3 
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Now we chain all inequalities and get: 


, A272KL , TT^ ^ 

R[n) < — -log n + en + —L . 

f*e 3 


Finally, we choose e = 


lA272KL\ogn 


f*n 


and get: 


R(„) < 2^2 ^ L < nJ 


KLnlogn ^ 

T* ’ 


which concludes our proof. 


C Technical Lemmas 

Lemma 1. Let A = (oi,..., a|^|) € Q be a feasible solution and Bk = (oi,..., a^) be a prefix of 
k < |A| items of A. Then k can be set such that > ^Aa andpB^. > 5 /*- 

Proof We consider two cases. First, suppose that f{A,w) > ^f*- Then our claims hold trivially 
for k = |^|. Now suppose that f{A, w) < ^f*. Then we choose k such that: 

f{Bk,w) < ^f* <PBk ■ 

Such k is guaranteed to exist because Ul=i[/(^i> w),pBi] = [fiA, w), 1], which follows from the 
facts that f{Bi, w) = pBiWiaf) for any i < |A| and pb^ = 1. We prove that | A^ as: 

ab, = r - f{BuM>\r >\aa. 

The first inequality is by our assumption and the second one holds for any solution A. ■ 

Lemma 2. Let 0 < pi ,...,pk < 1 and ui,, uk > 0. Then: 

K K K 

min {pk + UkA} < IT Ffc + Mfc . 
k=l k=l k=l 

This bound is tight when pi,... ,pk = 1 and ui,..., uk = 0. 


Proof. The proof is by induction on K. Our claim clearly holds when K = 1. Now choose K > 1 
and suppose that our claim holds for any 0 < pi,... tPk-i < 1 and ui,..., uk-i > 0. Then: 


K 

min {pk 


K-l 

Uk, 1 } = min{pK + UK, 1 } rnm{pk + Uk, 1 } 

/c=l 


< mm{pK + UK, 1 } 



K-l K-l K-l 

< Pk Y[ Pk UK Y[ Pk -\-mm{pK + UK, 1} ^ Uk 

k—l k—1 ^ k—1 

<1 


K K 

< IT Pfc+Mfc. 

k^i 
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